Fast algorithm for detecting community structure in networks 



m ■ 

o : 
o . 

<N ■ 
Q ■ 

m ; 
(N ■ 

<n : 



M. E. J. Newman 
Department of Physics and Center for the Study of Complex Systems, 
University of Michigan, Ann Arbor, MI 48109-1120 

It has been found that many networks display community structure — groups of vertices within 
which connections are dense but between which they are sparser — and highly sensitive computer 
algorithms have in recent years been developed for detecting such structure. These algorithms 
however are computationally demanding, which limits their application to small networks. Here we 
describe a new algorithm which gives excellent results when tested on both computer-generated and 
real-world networks and is much faster, typically thousands of times faster than previous algorithms. 
We give several example applications, including one to a collaboration network of more than 50 000 
physicists. 
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I. INTRODUCTION 

There has in recent years been a surge of interest 
within the physics community in the properties of net- 
works of many kinds, including the Internet, the world 
wide web, citation networks, transportation networks, 
software call graphs, email networks, food webs, and so- 
cial and biochemical networks P, 0, 0, 0- O ne prop- 
erty that has attracted particular attention is that of 
"community structure": the vertices in networks are of- 
ten found to cluster into tightly-knit groups with a high 
density of within-group edges and a lower density of 
between-group edges. Girvan and Newman [|| pro- 
posed a computer algorithm based on the iterative re- 
moval of edges with high "betweenness" scores that ap- 
pears to identify such structure with some sensitivity, 
and this algorithm has been employed by a number of 
authors in the study of such diverse systems as networks 
of email messages, social networks of animals, collabo- 
rations of jazz musicians, metabolic networks, and gene 
networks |B, H (S H ES E3 • As pointed out by New- 
man and Girvan the principle disadvantage of their 
algorithm is the high computational demands it makes. 
In its simplest and fastest form it runs in worst-case time 
0(m 2 n) on a network with m edges and n vertices, or 
0(n 3 ) on a sparse graph (one for which m scales with n in 
the limit of large n, which covers essentially all networks 
of current scientific interest, with the possible exception 
of food webs) . With typical computer resources available 
at the time of writing, this limits the algorithm's use to 
networks of a few thousand vertices at most, and sub- 
stantially less than this for interactive applications. In- 
creasingly however, there is interest in the study of much 
larger networks; citation and collaboration networks can 
contain millions of vertices 0,^3 f° r example, while the 
world wide web numbers in the billions |14|. 

In this paper, therefore, we propose a new algorithm 
for detecting community structure. The algorithm oper- 
ates on different principles to that of Girvan and New- 
man (GN) but, as we will show, gives qualitatively similar 
results. The worst-case running time of the algorithm is 
0((m + n)n), or 0(n 2 ) on a sparse graph. In practice, 
it runs to completion on current computers in reason- 



able times for networks of up to a million or so vertices, 
bringing within reach the study of communities in many 
systems that would previously have been considered in- 
tractable. 



II. THE ALGORITHM 

Our algorithm is based on the idea of modularity. 
Given any network, the GN community structure algo- 
rithm always produces some division of the vertices into 
communities, regardless of whether the network has any 
natural such division. To test whether a particular divi- 
sion is meaningful we define a quality function or "mod- 
ularity" Q as follows |6j . Let be the fraction of edges 
in the network that connect vertices in group i to those 
in group j , and let <ij — S 7 e ij D5 ■ Then 
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is the fraction of edges that fall within communities, mi- 
nus the expected value of the same quantity if edges fall 
at random without regard for the community structure. 
If a particular division gives no more within-community 
edges than would be expected by random chance we will 
get Q = 0. Values other than indicate deviations from 
randomness, and in practice values greater than about 
0.3 appear to indicate significant community structure. 
A number of examples are given in Ref.EJ 

But this now suggests an alternative approach to find- 
ing community structure. If a high value of Q represents 
a good community division, why not simply optimize Q 
over all possible divisions to find the best one? By do- 
ing this, we can avoid the iterative removal of edges and 
cut straight to the chase. The problem is that true op- 
timization of Q is very costly. The number of ways to 
divide n vertices into g non-empty groups is given by 
the Stirling number of the second kind Sn^ , and hence 
the number of distinct community divisions is $3o=i Sn^ ■ 
This sum is not known in closed form, but we observe 



that S. 
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S ( n ] = 2 r ' 



1 for all n > 1, so that the sum 



must increase at least exponentially in n. To carry out 
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an exhaustive search of all possible divisions for the op- 
timal value of Q would therefore take at least an expo- 
nential amount of time, and is in practice infeasible for 
systems larger than twenty or thirty vertices. Various 
approximate optimization methods are available: simu- 
lated annealing, genetic algorithms, and so forth. Here 
we consider a scheme based on a standard "greedy" op- 
timization algorithm, which appears to perform well. 

Our algorithm falls in the general cat egory o f agglom- 
erative hierarchical clustering methods [lol fl6j| . Starting 
with a state in which each vertex is the sole member of 
one of n communities, we repeatedly join communities 
together in pairs, choosing at each step the join that re- 
sults in the greatest increase (or smallest decrease) in Q. 
The progress of the algorithm can be represented as a 
"dendrogram," a tree that shows the order of the joins 
(see Fig. [21 for an example) . Cuts through this dendro- 
gram at different levels give divisions of the network into 
larger or smaller numbers of communities and, as with 
the GN algorithm, we can select the best cut by looking 
for the maximal value of Q. 

Since the joining of a pair of communities between 
which there are no edges at all can never result in an 
increase in Q, we need only consider those pairs between 
which there are edges, of which there will at any time 
be at most m, where m is again the number of edges in 
the graph. The change in Q upon joining two communi- 
ties is given by AQ = + e.ji — 2a,ia,j = 2(eij — asiOj), 
which can clearly be calculated in constant time. Fol- 
lowing a join, some of the matrix elements must be 
updated by adding together the rows and columns corre- 
sponding to the joined communities, which takes worst- 
case time O(n). Thus each step of the algorithm takes 
worst-case time 0(m + n). There are a maximum of 

— l join operations necessary to construct the complete 
dendrogram and hence the entire algorithm runs in time 
0((m + n)n), or 0(n 2 ) on a sparse graph. The algorithm 
has the added advantage of calculating the value of Q 
as it goes along, making it especially simple to find the 
optimal community structure. 

It is worth noting that our algorithm can be general- 
ized trivially to weighted networks in which each edge 
has a numeric strength associated with it, by making the 
initial values of the matrix elements equal to those 
strengths, rather than just zero or one; otherwise the 
algorithm is as above and has the same running time. 
The networks studied in this paper however are all un- 
weighted. 



III. APPLICATIONS 

As a first example of the working of our algorithm, we 
have generated using a computer a large number of ran- 
dom graphs with known community structure, which we 
then run through the algorithm to quantify its perfor- 
mance. Each graph consists of n — 128 vertices divided 
into four groups of 32. Each vertex has on average z; n 
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FIG. 1: The fraction of vertices correctly identified by our al- 
gorithms in the computer-generated graphs described in the 
text. The two curves show results for the new algorithm 
(circles) and for the algorithm of Girvan and Newman [fj 
(squares). Each point is an average over 100 graphs. 



edges connecting it to members of the same group and 
z ut edges to members of other groups, with z- m and z out 
chosen such that the total expected degree Zi n +z out = 16, 
in this case. As z out is increased from small values, the 
resulting graphs pose greater and greater challenges to 
the community-finding algorithm. In Fig. ^ we show the 
fraction of vertices correctly assigned to the four commu- 
nities by the algorithm as a function of z ou t ■ As the figure 
shows, the algorithm performs well, correctly identifying 
more than 90% of vertices for values of z out < 6. Only 
when z out approaches the value 8 at which the number of 
within- and between-community edges per vertex is the 
same does the algorithm begin to fail. On the same plot 
we also show the performance of the GN algorithm and, 
as we can see, that algorithm performs slightly but mea- 
surably better for smaller values of z out . For example, 
for z out = 5 our new algorithm correctly identifies an av- 
erage of 97.4(2)% of vertices, while the older algorithm 
correctly identifies 98.9(1)%. Both, however, clearly per- 
form well. 

Interestingly for higher values of z out our new algo- 
rithm performs better than the older one, and we have 
come across a few real- world networks in which this is the 
case also. Normally, however, the GN algorithm seems 
to have the edge, and this should come as no great sur- 
prise. Our new algorithm bases its decisions on purely lo- 
cal information about individual communities, while the 
GN algorithm uses non-local information about the entire 
network — information derived from betweenness scores. 
Since community structure is itself fundamentally a non- 
local quantity, it seems reasonable that one can do a bet- 
ter job of finding that structure if one has non-local in- 
formation at one's disposal. 

For systems small enough that the GN algorithm is 
computationally tractable, therefore, we see no reason 
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FIG. 2: Dendrogram of the communities found by our algo- 
rithm in the "karate club" network of Zachary 0, IT^ | . The 
shapes of the vertices represent the two groups into which the 
club split as the result of an internal dispute. 



not to continue using it — it appears to give the best 
results. For systems too large to make use of this ap- 
proach, however, our new algorithm gives useful com- 
munity structure information with comparatively little 
effort. 

We have applied our algorithm to a variety of real- 
world networks also. We have looked, for example, at 
the "karate club" network studied in , which represents 
friendships between 34 members of a club at a US univer- 
sity, as recorded over a two-year period by Zachary |17| . 
During the course of the study, the club split into two 
groups as a result of a dispute within the organization, 
and the members of one group left to start their own 
club. In Fig. [3 we show the dendrogram derived by feed- 
ing the friendship network into our algorithm. The peak 
modularity is Q = 0.381 and corresponds to a split into 
two groups of 17, as shown in the figure. The shapes of 
the vertices represent the alignments of the club mem- 
bers following the split and, as we can see, the division 
found by the algorithm corresponds almost perfectly to 
these alignments; only one vertex, number 10, is classified 
wrongly. The GN algorithm performs similarly on this 
task, but not better — it also finds the split but classifies 
one vertex wrongly (although a different one, vertex 3). 
In other tests, we find that our algorithm also success- 
fully detects the main two-way division of the dolphin 
social network of Lusseau @, an d the division be- 
tween black and white musicians in the jazz network of 
Gleiser and Danon . 

As a demonstration of how our algorithm can some- 
times miss some of the structure in a network, we take 
another example from Ref. [5. a network representing 
the schedule of games between American college foot- 
ball teams in a single season. Because the teams are di- 
vided into groups or "conferences," with intra-conference 
games being more frequent than inter-conference games, 
we have a reasonable idea ahead of time about what com- 
munities our algorithm should find. The dendrogram 
generated by the algorithm is shown in Fig. [21 and has 
an optimal modularity of Q = 0.546, which is a little shy 



of the value 0.601 for the best split reported in As 
the dendrogram reveals, the algorithm finds six commu- 
nities. Some of them correspond to single conferences, 
but most correspond to two or more. The GN algorithm, 
by contrast, finds all eleven conferences, as well as accu- 
rately identifying independent teams that belong to no 
conference. Nonetheless, it is clear that the new algo- 
rithm is quite capable of picking out useful community 
structure from the network, and of course it is much the 
faster algorithm. On the author's personal computer the 
algorithm ran to completion in an unmeasureably small 
time — less than a hundredth of a second. The algorithm 
of Girvan and Newman took a little over a second. 

A time difference of this magnitude will not present 
a big problem in most practical situations, but perfor- 
mance rapidly becomes an issue when we look at larger 
networks; we expect the ratio of running times to in- 
crease with the number of vertices. Thus, for example, 
in applying our algorithm to the 1275-node network of 
jazz musician collaborations mentioned above, we found 
that it runs to completion in about one second of CPU 
time. The GN algorithm by contrast takes more than 
three hours to reach very similar results. 

As an example of an analysis made possible by the 
speed of the new algorithm, we have looked at a network 
of collaborations between physicists as documented by 
papers posted on the widely-used Physics E-print Archive 
at arxiv . org. The network is an updated version of the 
one described in Ref. Il3l in which scientists are consid- 
ered connected if they have coauthored one or more pa- 
pers posted on the archive. We analyze only the largest 
component of the network, which contains n — 56 276 sci- 
entists in all branches of physics covered by the archive. 
Since two vertices that are unconnected by any path are 
never put in the same community by our algorithm, the 
small fraction of vertices that are not part of the largest 
component can safely be assumed to be in separate com- 
munities in the sense of our algorithm. Our algorithm 
takes 42 minutes to find the full community structure. 
Our best estimates indicate that the GN algorithm would 
take somewhere between three and five years to complete 
its version of the same calculation. 

The analysis reveals that the network in question con- 
sists of about 600 communities, with a high peak modu- 
larity of Q — 0.713, indicating strong community struc- 
ture in the physics world. Four of the communities found 
are large, containing between them 77% of all the ver- 
tices, while the others are small — see Fig. left panel. 
The four large communities correspond closely to subject 
subareas: one to astrophysics, one to high-energy physics, 
and two to condensed matter physics. Thus there ap- 
pears to be a strong correlation between the structure 
found by our algorithm and the community divisions per- 
ceived by human observers. It is precisely correlation 
of this kind that makes community structure analysis a 
useful tool in understanding the behavior of networked 
systems. 

We can repeat the analysis with any of the subcom- 
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FIG. 3: Dendrogram of the communities found in the college football network described in the text. The real-world 
communities — conferences — are denoted by the different shapes as indicated in the legend. 



munities to observe how they break up. For example, 
feeding the smaller of the two condensed-matter groups 
through the algorithm again, we find an even stronger 
peak modularity of Q — 0.807 — the strongest we have 
yet observed in any network — corresponding to a split 
into about a hundred communities of all sizes (Fig. 01 
center panel) . The sizes appear roughly to have a power- 
law distribution with exponent about —2 pol |. Narrowing 
our focus still further to the particular one of these com- 
munities that contains the present author, we find the 
structure shown in the right panel of Fig. 0] Feeding this 
one last time through the algorithm, it breaks apart into 
communities that correspond closely to individual insti- 
tutional research groups, the author's group appearing in 
the corner of the figure, highlighted by the dashed box. 
One could pursue this line of analysis further, identify- 
ing individual groups, iteratively breaking them down, 
and looking for example at the patterns of collaboration 
between them, but we leave this for later studies. 

IV. CONCLUSIONS 

In this paper we have described a new algorithm for ex- 
tracting community structure from networks, which has 



a considerable speed advantage over previous algorithms, 
running to completion in time that scales as the square 
of the network size. This allows us to study much larger 
systems than has previously been possible. Among other 
examples, we have applied the algorithm to a network of 
collaborations between more than fifty thousand physi- 
cists, and found that the resulting community structure 
corresponds closely to the traditional divisions between 
specialties and research groups in the field. 

We believe that our method will not only allow for the 
extension of community structure analysis to some of the 
very large networks that are now being studied for the 
first time, but also provides a useful tool for visualizing 
and understand the structure of these networks, whose 
daunting size has hitherto made many of their structural 
properties obscure. 
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FIG. 4: Left panel: Community structure in the collaboration network of physicists. The graph breaks down into four large 
groups, each composed primarily to physicists of one specialty, as shown. Specialties are determined by the subsection(s) 
of the e-print archive in which individuals post papers: "CM." indicates condensed matter; "H.E.P." high-energy physics 
including theory, phenomenology, and nuclear physics; "astro" indicates astrophysics. Middle panel: one of the condensed 
matter communities is further broken down by the algorithm, revealing an approximate power-law distribution of community 
sizes. Right panel: one of these smaller communities is further analyzed to reveal individual research groups (different shades), 
one of which (in dashed box) is the author's own. 
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